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^^ Abstract. Object detection is one of the key tasks in computer vision. 



The cascade framework of Viola and Jones has become the de facto 
standard. A classifier in each node of the cascade is required to achieve 
extremely high detection rates, instead of low overall classification error. 
Although there are a few reported methods addressing this requirement 
in the context of object detection, there is no a principled feature se- 
lection method that explicitly takes into account this asymmetric node 
learning objective. We provide such a boosting algorithm in this work. 
It is inspired by the linear asymmetric classifier (LAC) of [1] in that our 
boosting algorithm optimizes a similar cost function. The new totally- 



Y^ corrective boosting algorithm is implemented by the column generation 

I I technique in convex optimization. Experimental results on face detection 

suggest that our proposed boosting algorithms can improve the state-of- 
the-art methods in detection performance. 



1 Introduction 



Real-time detection of various categories of objects in images is one of the key 

r^ tasks in computer vision. This topic has been extensively studied in the past 

f^ a few years due to its important applications in surveillance, intelligent video 

^~~^ analysis etc. Viola and Jones proffered the first real-time face detector [2,3]. 

J> To date, it is still considered one of the state-of-the-art, and their framework 

is the basis of many incremental work afterwards. Object detection is a highly 
asymmetric classification problem with the exhaustive scanning-window search 
^ being used to locate the target in an image. Only a few are true target objects 

among the millions of scanned patches. Cascade classifiers have been proposed 
for efficient detection, which takes the asymmetric structure into consideration. 
Under the assumption of each node of the cascade classifier makes independent 
classification errors, the detection rate and false positive rate of the entire cascade 
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are: -Fdr — Y[t=i'^t ^^'^ -^fp ~ Ilt^i /*' respectively. As pointed out in [2,1], 
these two equations suggest a node learning objective: Each node should have 
an extremely high detection rate dt {e.g., 99.7%) and a moderate false positive 
rate ft {e.g., 50%). With the above values of dt and ft, assume that the cascade 
has A^ = 20 nodes, then F^,- « 94% and Ffp « 10^^, which is usually the design 
goal. 

A drawback of standard boosting like AdaBoost is that it does not take 
advantage of the cascade classifier. AdaBoost only minimizes the overall clas- 
sification error and does not minimize the number of false negatives. In this 
sense, the features selected are not optimal for the purpose of rejecting negative 
examples. At the feature selection and classifier training level, Viola and Jones 
leveraged the asymmetry property, to some extend, by replacing AdaBoost with 
AsymBoost [3] . AsymBoost incurs more loss for misclassifying a positive example 
by simply modifying AdaBoost's exponential loss. Better detection rates were 
observed over the standard AdaBoost. Nevertheless, AsymBoost addresses the 
node learning goal indirectly and still may not be the optimal solution. Wu et al. 
explicitly studied the node learning goal and they proposed to use linear asym- 
metric classifier (LAC) and Fisher linear discriminant analysis (LDA) to adjust 
the linear coefficients of the selected weak classifiers [1,4]. Their experiments 
indicated that with this post-processing technique, the node learning objective 
can be better met, which is translated into improved detection rates. In Viola 
and Jones' framework, boosting is used to select features and at the same time to 
train a strong classifier. Wu et al.'s work separates these two tasks: they still use 
AdaBoost or AsymBoost to select features; and at the second step, they build 
a strong classifier using LAC or LDA. Since there are two steps here, in Wu 
et a/.'s work [1,4], the node learning objective is only considered at the second 
step. At the first step — feature selection — the node learning objective is not ex- 
plicitly considered. We conjecture that further improvement may be gained if the 
node learning objective is explicitly taken into account at both steps. We design 
new boosting algorithms to implement this idea and verify this conjecture. Our 
major contributions are as follows. 

1. We develop new boosting-like algorithms via directly minimizing the objec- 
tive function of linear asymmetric classifier, which is termed as LACBoost 
(and FisherBoost from Fisher LDA). Both of them can be used to select 
features that is optimal for achieving the node learning goal in training a 
cascade classifier. To our knowledge, this is the first attempt to design such 
a feature selection method. 

2. LACBoost and FisherBoost share similarities with LPBoost [5] in the sense 
that both use column generation — a technique originally proposed for large- 
scale linear programming (LP). Typically, the Lagrange dual problem is 
solved at each iteration in column generation. We instead solve the pri- 
mal quadratic programming (QP) problem, which has a special structure 
and entropic gradient (EG) can be used to solve the problem very effi- 
ciently. Compared with general interior-point based QP solvers, EG is much 
faster. Considering one needs to solve QP problems a few thousand times 
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for training a complete cascade detector, the efficiency improvement is enor- 
mous. Compared with training an AdaBoost based cascade detector, the 
time needed for LACBoost (or FisherBoost) is comparable. This is because 
for both cases, the majority of the time is spent on weak classifier training 
and bootstrapping. 

3. We apply LACBoost and FisherBoost to face detection and better perfor- 
mances are observed over the state-of-the-art methods [1,4]. The results 
confirm our conjecture and show the effectiveness of LACBoost and Fisher- 
Boost. LACBoost can be immediately applied to other asymmetric classifi- 
cation problems. 

4. We also analyze the condition that makes the validity of LAC, and show that 
the multi-exit cascade might be more suitable for applying LAC learning of 
[1,4] (and our LACBoost) rather than Viola- Jones standard cascade. 

Besides these, the LACBoost/FisherBoost algorithm differs from traditional 
boosting algorithms in that LACBoost/FisherBoost does not minimize a loss 
function. This opens new possibilities for designing new boosting algorithms for 
special purposes. We have also extended column generation for optimizing non- 
linear optimization problems. Next we review some related work that is closest 
to ours. 

Related work There is a large body of previous work in object detection 
[6, 7]; of particular relevance to our work is boosting object detection originated 
from Viola and Jones' framework. There are three important components that 
make Viola and Jones' framework tremendously successful: (1) The cascade clas- 
sifier that efficiently filters out most negative patches in early nodes; and also 
contributes to enable the final classifier to have a very high detection rate; (2) 
AdaBoost that selects informative features and at the same time trains a strong 
classifier; (3) The use of integral images, which makes the computation of Haar 
features extremely fast. Most of the work later improves one or more of these 
three components. In terms of the cascade classifier, a few different approaches 
such as soft cascade [8], dynamic cascade [9], and multi-exit cascade [10]. We 
have used the multi-exit cascade in this work. The multi-exit cascade tries to 
improve the classification performance by using all the selected weak classifiers 
for each node. So for the n-th strong classifier (node) , it uses all the weak classi- 
fiers in this node as well as those in the previous n — 1 nodes. We show that the 
LAC post-processing can enhance the multi-exit cascade. More importantly, we 
show that the multi-exit cascade better meets LAC's requirement of data being 
Gaussian distributions. 

The second research topic is the learning algorithm for constructing a clas- 
sifier. Wu et al. use fast forward feature selection to accelerate the training 
procedure [7]. They have also proposed LAC to learn a better strong classifier 
[1]. Pham and Cham recently proposed online asymmetric boosting with con- 
siderable improvement in training time [6]. By exploiting the feature statistics, 
they have also designed a fast method to train weak classifiers [11]. Li et al. 
advocated FloatBoost to discard some redundant weak classifiers during Ad- 
aBoost's greedy selection procedure [12]. Liu and Shum proposed KLBoost to 



4 C. Shen, P. Wang, and H. Li 

select features and train a strong classifier [f3]. Other variants of boosting have 
been applied to detection. 

Notation The following notation is used. A matrix is denoted by a bold 
upper-case letter (X); a column vector is denoted by a bold lower-case letter (x). 
The ith row of X is denoted by X^. and the i-th column X.^. The identity matrix 
is I and its size should be clear from the context. 1 and arc column vectors of 
I's and O's, respectively. We use ^, ^ to denote component- wise inequalities. 

Let {(xi, yi)}j=i^... „i be the set of training data, where x^ G A" and yi € 
{—1,-1-1}, Vi. The training set consists of TOi positive training points and m2 
negative ones; mi -I- 1x12 = rn. Let h{-) € "H be a weak classifier that projects an 
input vector x into {— 1,+1}. Here we only consider discrete classifier outputs. 
We assume that the set "H is finite and we have n possible weak classifiers. Let 
the matrix H e M™^" where the {i,j) entry of H is H^j ~ hj{xi). Hij is the 
label predicted by weak classifier hj{-) on the training datum x^. We define a 
matrix A G K™x" such that its {i,j) entry is Aij — yihj{xi). 

2 Linear Asymmetric Classification 

Before we propose our LACBoost and FisherBoost, we briefly overview the con- 
cept of LAC. Wu et al. [4] have proposed linear asymmetric classification (LAC) 
as a post-processing step for training nodes in the cascade framework. LAC is 
guaranteed to get an optimal solution under the assumption of Gaussian data 
distributions. 

Suppose that we have a linear classifier /(x) = sign("w^x — fe), if we want 
to find a pair of {w, 6} with a very high accuracy on the positive data Xi and 
a moderate accuracy on the negative X2, which is expressed as the following 
problem: 

max Pr jw^xi > b\, s.t. Pr {w^X2 < 6| = A, /'i\ 

where x ^ (/i, S) denotes a symmetric distribution with mean /i and covariance 
S. If we prescribe A to 0.5 and assume that for any w, w^xi is Gaussian and 
w^X2 is symmetric, then (1) can be approximated by 

max — , (2) 

(2) is similar to LDA's optimization problem 

max — , (3) 

w#0 ^wT(Si + E2)w 

(2) can be solved by eigen-decomposition and a close-formed solution can be 
derived: 

w* = Sr'(Mi - Ai2), b* = w*^/X2. (4) 
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On the other hand, each node in cascaded boosting classifiers has the following 
form: 

/(x) = sign(wTH(x) - b), (5) 

We override the symbol H(x) here, which denotes the output vector of all weak 
classifiers over the datum x. We can cast each node as a linear classifier over 
the feature space constructed by the binary outputs of all weak classifiers. For 
each node in cascade classifier, we wish to maximize the detection rate as high 
as possible, and meanwhile keep the false positive rate to an moderate level 
{e.g., 50.0%). That is to say, the problem (1) expresses the node learning goal. 
Therefore, we can use boosting algorithms {e.g., AdaBoost) as feature selection 
methods, and then use LAC to learn a linear classifier over those binary features 
chosen by boosting. The advantage is that LAC considers the asymmetric node 
learning explicitly. 

However, there is a precondition of LAC's validity. That is, for any w, "w^Xi 
is a Gaussian and w^X2 is symmetric. In the case of boosting classifiers, w^Xi 
and w^X2 can be expressed as the margin of positive data and negative data. 
Empirically Wu et al. [4] verified that w^x is Gaussian approximately for a 
cascade face detector. We discuss this issue in the experiment part in more 
detail. 

3 Constructing Boosting Algorithms from LDA and LAC 

In kernel methods, the original data are nonlinearly mapped to a feature space 
and usually the mapping function 0(-) is not explicitly available. It works through 
the inner product of (l){xiY (j){xj) . In boosting [14], the mapping function can be 
seen as explicitly known through: 0(x) : x i— >■ [/ii(x), . . . , ft,„(x)]. Let us consider 
the Fisher LDA case first because the solution to LDA will generalize to LAC 
straightforwardly, by looking at the similarity between (2) and (3). 

Fisher LDA maximizes the between-class variance and minimizes the within- 
class variance. In the binary-class case, we can equivalently rewrite (3) into 

(a^i - M2)^ w^CfcW 
max == ^p- , (6) 

where Cb and C^, are the between-class and within-class scatter matrices; /ii 
and /X2 are the projected centers of the two classes. The above problem can be 
equivalently reformulated as 

min w^C^w - 0{fii - ^2) (7) 

for some certain constant 6 and under the assumption that /ii — /i2 > 0.^ Now 
in the feature space, our data arc (j){xi), i = 1 . . .m. We have 

m = — w^ V 0(x,) = — V Ai.,vf = — V (Aw)j = eJAw, (8) 

TOl '^ — ' mi ^ — ' TTil '^ — ' 

yi=i yi=i yi=i 



* In our face detection experiment, we found that this assumption could always be 
satisfied. 
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where Aj: is the i-th row of A. 
mo ^ — ' 



X,, = 



— y^ Hi-w = —63 Aw, 

ni2 ^-^ 



(9) 



«; = -! 



Here the i-th entry of e\ is defined as eu — l/rrii if yi = +1, otherwise en — 0. 
Similarly 624 = l/rn2 if yi — —1, otherwise e2i — 0. We also define e = ei + 62. 
For ease of exposition, we order the training data according to their labels. So 
the vector e e M™: 

e-[l/mi,...,l/m2,---r, (10) 

and the first mi components of p correspond to the positive training data and the 
remaining ones correspond to the TO2 negative data. So we have /ii — /i2 = e^P) 
Cw = mi/m ■ Si + m2/m ■ S2 with Si^2 the covariance matrices. By noticing 
that 

w^Si,2w = — ;^ y^ {pi-pkf, 



TOi,2(toi 



1 






i>k,yi=yk=±l 

we can easily rewrite the original problem into: 

min ip^Qp — 6e^ p, s.t. w )>= 0, l^w — 1, pi — (Aw)i, i ~ 1,- ■ ■ ,5 



w,p 



(11) 



Here Q = 



Qi 
Q2 



is a block matrix with 



Qi 



m{'mi 



m(mi — l) 



m{rni — l) 



m(mi — l) m(mi—l) 



and Q2 is similarly defined by replacing mi with m,2 in Qi. Also note that we 
have introduced a constant | before the quadratic term for convenience. The 
normalization constraint l^w = 1 removes the scale ambiguity of w. Otherwise 
the problem is ill-posed. 

In the case of LAC, the covariance matrix of the negative data is not involved, 

which corresponds to the matrix Q2 is zero. So we can simply set Q = „ „ 

and (11) becomes the optimization problem of LAC. 

At this stage, it remains unclear about how to solve the problem (11) because 
we do not know all the weak classifiers. The number of possible weak classifiers 
could be infinite — the dimension of the optimization variable w is infinite. So (11) 
is a semi- infinite quadratic program (SIQP). We show how column generation 
can be used to solve this problem. To make column generation applicable, we 
need to derive a specific Lagrange dual of the primal problem. 

The Lagrange dual problem We now derive the Lagrange dual of the 
quadratic problem (11). Although we are only interested in the variable w, we 
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need to keep the auxiliary variable p in order to obtain a meaningful dual prob- 
lem. The Lagrangian of (11) is L{w, p , u,r) = ^p^Qp de^ p + u^ {p — Aw) — 

primal dual 

q^w + r{'iJw — 1) with q :>= 0. sup^ ,. infw,p i(w, p, u, r) gives the following 
Lagrange dual: 

rcgularization 



max -r- i(u-0e)"^Q"^(u-0e), s.t. ^m^Aj: ^ rl"^. (12) 



u,r 



In our case, Q is rank-deficient and its inverse does not exist (for both LDA and 
LAC). We can simply regularize Q with Q + (51 with S a very small constant. 
One of the KKT optimality conditions between the dual and primal is p* = 
— Q^^(u* — 9e), which can be used to establish the connection between the dual 
optimum and the primal optimum. This is obtained by the fact that the gradient 
of L w.r.t. p must vanish at the optimum, dL/dpi = 0, Vz = 1 • • • n. 

Problem (12) can be viewed as a regularized LPBoost problem. Compared 
with the hard-margin LPBoost [5] , the only difference is the rcgularization term 
in the cost function. The duality gap between the primal (11) and the dual (12) 
is zero. In other words, the solutions of (11) and (12) coincide. Instead of solving 
(11) directly, one calculates the most violated constraint in (12) iteratively for 
the current solution and adds this constraint to the optimization problem. In 
theory, any column that violates dual feasibility can be added. To speed up 
the convergence, we add the most violated constraint by solving the following 
problem: 

m 

/i'(-) = argmax;j(.) ^Uiyj/i(xj). (13) 

1=1 

This is exactly the same as the one that standard AdaBoost and LPBoost 
use for producing the best weak classifier. That is to say, to find the weak 
classifier that has minimum weighted training error. We summarize the LAC- 
Boost/FisherBoost algorithm in Algorithm 1. By simply changing Q2, Algo- 
rithm 1 can be used to train either LACBoost or FisherBoost. Note that to 
obtain an actual strong classifier, one may need to include an offset 6, i.e. the 
final classifier is Y^^=i ^ji^) ~ ^ because from the cost function of our algorithm 
(7), we can see that the cost function itself does not minimize any classification 
error. It only finds a projection direction in which the data can be maximally 
separated. A simple line search can find an optimal b. Moreover, when training 
a cascade, we need to tune this offset anyway as shown in (5). 

The convergence of Algorithm 1 is guaranteed by general column generation 
or cutting-plane algorithms, which is easy to establish. When a new h'{-) that 
violates dual feasibility is added, the new optimal value of the dual problem 
(maximization) would decrease. Accordingly, the optimal value of its primal 
problem decreases too because they have the same optimal value due to zero 
duality gap. Moreover the primal cost function is convex, therefore in the end it 
converges to the global minimum. 
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Algorithm 1 Column generation for QP. 



Input: Labeled training data (xi, j/i), i = 1 • • • m; termination threshold £ > 0; 
regularization parameter 6; maximum number of iterations nmax- 

1 Initialization: m = 0; w = 0; and m = —, i = 1- ■ -m. 

2 for iteration = 1 : Wmax do 

3 — Check for the optimality: 

if iteration > 1 and X^I^i Wij/ife'(xi) < r + e, 
then 

break; and the problem is solved; 

4 — Add /i'(-) to the restricted master problem, which corresponds to a new 
constraint in the dual; 

5 — Solve the dual problem (12) (or the primal problem (11)) and update r 
and Ui (i = 1 • • • m). 

6 — Increment the number of weak classifiers n = n + 1. 

Output: The selected features are hi,h2, ■ ■ ■ ,h„. The final strong classifier 
is: -F(x) — X]"=i ''"jhj{x) — b. Here the offset b can be learned by a 



simple search. 



At each iteration of column generation, in theory, we can solve either the 
dual (12) or the primal problem (11). However, in practice, it could be much 
faster to solve the primal problem because (i) Generally, the primal problem 
has a smaller size, hence faster to solve. The number of variables of (12) is 
m at each iteration, while the number of variables is the number of iterations 
for the primal problem. For example, in Viola- Jones' face detection framework, 
the number of training data m = 10, 000 and rimax — 200. In other words, the 
primal problem has at most 200 variables in this case; (ii) The dual problem is 
a standard QP problem. It has no special structure to exploit. As we will show, 
the primal problem belongs to a special class of problems and can be efficiently 
solved using entropic/exponcntiated gradient descent (EG) [15,16]. A fast QP 
solver is extremely important for training a object detector because we need to 
the solve a few thousand QP problems. 

We can recover both of the dual variables u* , r* easily from the primal vari- 
able w*: 

u* = Qp* + ee; (14) 

r*= max{E"i<A,,}. (15) 

The second equation is obtained by the fact that in the dual problem's con- 
straints, at optimum, there must exist at least one u* such that the equality 
holds. That is to say, r* is the largest edge over all weak classifiers. 

We give a brief introduction to the EG algorithm before we proceed. Let us 
first define the unit simplex Z\„ = {w e M" : l^w = l,w )>= 0}. EG efficiently 
solves the convex optimization problem 

min /(w), s.t. w e Z\„, (16) 
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under the assumption that the objective function /(•) is a convex Lipschitz 
continuous function with Lipschitz constant Lf w.r.t. a fixed given norm ||-|j. 
The mathematical definition of Lf is that |/(w) — /(z)| < Lf\\x — z|| holds for 
any x, z in the domain of /(•). The EG algorithm is very simple: 

1. Initialize with w*^ G the interior of Z\„; 

2. Generate the sequence {w'^}, /c = 1, 2, • • • with: 



w^- 



w*^ ^exp[ 



-rfe/j(wfc-i)] 
E-.iw)-iexp[-r,/'(w'=-i) 



(17) 



Here Tk is the step-size, /'(w) = [/((w), 
3. Stop if some stopping criteria are met. 



> /4(w)r is the gradient of /(•); 



The learning step-size can be determined by tj. 



^^?^^, following [15]. In 
[16], the authors have used a simpler strategy to set the learning rate. 

EG is a very useful tool for solving large-scale convex minimization problems 
over the unit simplex. Gompared with standard QP solvers like Mosek [17], EG 
is much faster. EG makes it possible to train a detector using almost the same 
amount of time as using standard AdaBoost as the majority of time is spent on 
weak classifier training and bootstrapping. 

In the case that mi 3> 1, 



Qi = 



m 



1 

1 



mi — 1 
1 



mi — 1 
mi — 1 



1 1 

mi — 1 mi — 1 



m 



Similarly, for LDA, Q2 ~ —I when m2 ^ 1. Hence, 



Q 



il; for Fisher LDA, 

for LAC. 



I 
00 



(18) 



Therefore, the problems involved can be simplified when mi 3> 1 and m2 3> 1 
hold. The primal problem (11) equals 



min iw^(A^QA)w- (6'e^A)w, s.t.weAn- 



w.p 



(19) 



We can efficiently solve (19) using the EG method. In EG there is an important 
parameter Lf, which is used to determine the step-size. Lf can be determined 
by the ^oo-norm of |/'(w)|. In our case /'(w) is a linear function, which is trivial 
to compute. The convergence of EG is guaranteed; see [15] for details. 

In summary, when using EG to solve the primal problem. Line 5 of Algo- 
rithm 1 is: 

— Solve the primal problem (19) using EG, and update the dual variables u 
with (14), and r with (15). 
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Fig. 1. Decision boundaries of AdaBoost (left) and FisherBoost (right) on 2D artificial 
data (positive data represented by D's and negative data by x's). Weak classifiers are 
decision stumps. In this case, FisherBoost intends to correctly classify more positive 
data in this case. 
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Fig. 2. Normality test (normal probability plot) for the face data's margin distribution 
of nodes 1, 2, 3. The 3 nodes contains 7, 22, 52 weak classifiers respectively. Curves 
close to a straight line mean close to a Gaussian. 



4 Applications to Face Detection 

First, let us show a simple example on a synthetic dataset (more negative data 
than positive data) to illustrate the difference between FisherBoost and Ad- 
aBoost. Fig. 1 demonstrates the subtle difference of the classification boundaries 
obtained by AdaBoost and FisherBoost. We can see that FisherBoost seems to 
focus more on correctly classifying positive data points. This might be due to 
the fact that AdaBoost only optimizes the overall classification accuracy. This 
finding is consistent with the result in [18]. 

Face detection In this section, we compare our algorithm with other state- 
of-art face detectors. We first show some results about the validity of LAC (or 
Fisher LDA) post-processing for improving node learning in object detection. 
Fig. 2 illustrates the normal probability plot of margins of positive training 
data, for the first three nodes in the multi-exit with LAC cascade. Clearly, the 
larger number of weak classifiers being used, the more closely the margin follows 
Gaussian distribution. In other words, LAC may achieve a better performance 
if a larger number of weak classifiers are used. The performance could be poor 
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with too fewer weak classifiers. The same statement apphes to Fisher LDA, 
and LACBoost, FisherBoost, too. Therefore, we do not apply LAC/LDA in the 
first eight nodes because the margin distribution could be far from a Gaussian 
distribution. Because the late nodes of a multi-exit cascade contain more weak 
classifiers, we conjecture that the multi-exit cascade might meet the Gaussianity 
requirement better. We have compared multi-exit cascades with LDA/LAC post- 
processing against standard cascades with LDA/LAC post-processing in [4] and 
slightly improved performances were obtained. 

Six methods are evaluated with the multi-exit cascade framework [10], which 
are AdaBoost with LAC post-processing, or LDA post-processing, AsymBoost 
with LAC or LDA post-processing [4], and our FisherBoost, LACBoost. We have 
also implemented Viola- Jones' face detector as the baseline [2]. As in [2], five 
basic types of Haar-like features are calculated, which makes up of a 162, 336 
dimensional over-complete feature set on an image of 24 x 24 pixels. To speed 
up the weak classifier training, as in [4], we uniformly sample 10% of features for 
training weak classifiers (decision stumps). The training data are 9, 832 mirrored 
24 X 24 face images (5, 000 for training and 4, 832 for validation) and 7, 323 large 
background images, which are the same as in [4]. 

Multi-exit cascades with 22 exits and 2, 923 weak classifiers are trained with 
various methods. For fair comparisons, we have used the same cascade structure 
and same number of weak classifiers for all the compared learning methods. The 
indexes of exits are pre-set to simplify the training procedure. For our Fisher- 
Boost and LACBoost, we have an important parameter 9, which is chosen from 
^16' J2^ 15' ^' ^' 'io' io' ^^' ^® have not carefully tuned this parameter using 
cross-validation. Instead, we train a 10-node cascade for each candidate 9, and 
choose the one with the best training accuracy.^ At each exit, negative exam- 
ples misclassified by current cascade are discarded, and new negative examples 
are bootstrapped from the background images pool. Totally, billions of negative 
examples are extracted from the pool. The positive training data and validation 
data keep unchanged during the training process. 

Our experiments are performed on a workstation with 8 Intel Xeon E5520 
CPUs and 32GB RAM. It takes about 3 hours to train the multi-exit cascade 
with AdaBoost or AsymBoost. For FisherBoost and LACBoost, it takes less than 
4 hours to train a complete multi-exit cascade.® In other words, our EG algorithm 
takes less than 1 hour for solving the primal QP problem (we need to solve a 
QP at each iteration). A rough estimation of the computational complexity is 
as follows. Suppose that the number of training examples is to, number of weak 
classifiers is n. At each iteration of the cascade training, the complexity for 
solving the primal QP using EG is 0(ran + kn?) with k the iterations needed 
for EQ's convergence. The complexity for training the weak classifier is 0{md) 
with d the number of all Haar-feature patterns. In our experiment, to, — 10, 000, 

■^ To train a complete 22-node cascade and choose the best 9 on cross-validation data 
may give better detection rates. 

® Our implementation is in C++ and only the weak classifier training part is paral- 
lelized using OpenMP. 
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n « 2900, d ~ 160, 000, k < 500. So the majority of the training computation is 
on the weak classifier training. 

We have also experimentally observed the speedup of EG against standard 
QP solvers. We solve the primal QP defined by (19) using EG and Mosek [17]. 
The QP's size is 1,000 variables. With the same accuracy tolerance (Mosek's 
primal-dual gap is set to 10^^ and EG's convergence tolerance is also set to 10~^), 
Mosek takes 1.22 seconds and EG is 0.0541 seconds. So EG is about 20 times 
faster. Moreover, at iteration n+1 of training the cascade, EG can take advantage 
of the last iteration's solution by starting EG from a small perturbation of the 
previous solution. Such a warm-start gains a 5 to 10 x speedup in our experiment, 
while there is no off-the-shelf warm-start QP solvers available yet. 

We evaluate the detection performance on the MIT+CMU frontal face test 
set. Two performance metrics are used here: each node and the entire cascade. 
The node metric is how well the classifiers meet the node learning objective. The 
node metric provides useful information about the capability of each method to 
achieve the node learning goal. The cascade metric uses the receiver operating 
characteristic (ROC) to compare the entire cascade's peformance. Multiple issues 
have impacts on the cascade's performance: classifiers, the cascade structure, 
bootstrapping etc. 

We show the node comparison results in Fig. 3. The node performances 
between FisherBoost and LACBoost are very similar. From Fig. 3, as reported 
in [4], LDA or LAC post-processing can considerably reduce the false negative 
rates. As expected, our proposed FisherBoost and LACBoost can further reduce 
the false negative rates significantly. This verifies the advantage of selecting 
features with the node learning goal being considered. 

From the ROC curves in Fig. 4, we can see that FisherBoost and LACBoost 
outperform all the other methods. In contrast to the results of the detection rate 
for each node, LACBoost is slightly worse than FisherBoost in some cases. That 
might be due to that many factors have impacts on the final result of detection. 
LAC makes the assumption of Gaussianity and symmetry data distributions, 
which may not hold well in the early nodes. This could explain why LACBoost 
does not always perform the best. Wu et al. have observed the same phenomenon 
that LAC post-processing does not outperform LDA post-processing in a few 
cases. However, we believe that for harder detection tasks, the benefits of LAC- 
Boost would be more impressive. 

The error reduction results of FisherBoost and LACBoost in Fig. 4 are not 
as great as those in Fig. 3. This might be explained by the fact that the cascade 
and negative data bootstrapping remove of the error reducing effects, to some 
extend. We have also compared our methods with the boosted greedy sparse 
LDA (BGSLDA) in [18], which is considered one of the state-of-the-art. We 
provide the ROC curves in the supplementary package. Both of our methods 
outperform BGSLDA with AdaBoost/AsymBoost by about 2% in the detection 
rate. Note that BGSLDA uses the standard cascade. So besides the benefits of 
our FisherBoost/LACBoost, the multi-exit cascade also brings effects. 
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Fig. 3. Node performances on the validation data. "Ada" means that features are 
selected using AdaBoost; "Asym" means that features are selected using AsymBoost. 



5 Conclusion 

By explicitly taking into account the node learning goal in cascade classifiers, 
we have designed new boosting algorithms for more effective object detection. 
Experiments validate the superiority of our FisherBoost and LACBoost. We have 
also proposed to use entropic gradient to efficiently implement FisherBoost and 
LACBoost. The proposed algorithms are easy to implement and can be applied 
other asymmetric classification tasks in computer vision. We are also trying 
to design new asymmetric boosting algorithms by looking at those asymmetric 
kernel classification methods. 
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